Project FMT

• DOMAIN: Semiconductor manufacturing process

• CONTEXT: A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs. These signals can be used as features to predict the yield type. And by analysing and trying out different combinations of features, essential signals that are impacting the yield type can be identified.

DATA DESCRIPTION: sensor-data.csv : (1567, 592) The data consists of 1567 datapoints each with 591 features. The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing. Target column “ –1” corresponds to a pass and “1” corresponds to a fail and the data time stamp is for that specific test point.

• PROJECT OBJECTIVE: We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyse whether all the features are required to build the model or not.

1 Import and understand the data

Q 1A Import ‘signal-data.csv’ as DataFrame

Ans 1A

Q . 1B. Print 5 point summary and share at least 2 observations

Ans 1B.

Observations


  1. Data cleansing:

Q 2A.. Write a for loop which will remove all the features with 20%+ Null values and impute rest with mean of the feature

And 2A

Q 2B dentify and drop the features which are having same value for all the rows.

Ans 2B.

Q 2C . Drop other features if required using relevant functional knowledge. Clearly justify the same

Ans 2C

Some of the columns had 98% same value and the variance is very less to influence the preditction and accuracy. Hence these columns are identifed and dropped

Q 2D. . . Check for multi-collinearity in the data and take necessary action.

Ans 2D Find the VIF value to find multi-collinearity and remove features with high VIF value

Q 2E. Make all relevant modifications on the data using both functional/logical reasoning/assumptions.

Ans 2E


3. Data analysis & visualisation

Q 3A Perform a detailed univariate Analysis with appropriate detailed comments after each analysis

Ans 3A.

The distribution of feature '75' follows close to agaussian / normal distribution Most of the observations lies between -0.35 and 0.25 i.e class 3 and 4 has the most of the values The median , mean and mode , all three are very closely placed in the distribution

Q 3B. Perform bivariate and multivariate analysis with appropriate detailed comments after each analysis.

Ans 3B.

Very less correlation exisits between the features and target variable Data imbalanced as there are more data available for failure scenarios


4. Data pre-processing:

Q 4A. Segregate predictors vs target attributes.

Ans 4A.

Q 4B. Check for target balancing and fix it if found imbalanced

Ans 4B

Pass (value 1 considered as Pass) is less than 10% of the whole column data. Change alternate rows target value to 1. To enable easy classification change -1 to 0

Q 4C. C. Perform train-test split and standardise the data or vice versa if required.

Ans 4C

Q 4D. Check if the train and test data have similar statistical characteristics when compared with original data.

Ans 4D.

The Correlation between Train, test and orginal data is simillar


5. Model training, testing and tuning:

Q 5A. Use any Supervised Learning technique to train a model.

Ans 5A. Use decision tree classifier

Q 5B. B. Use cross validation techniques.

Ans 5B.

Q 5C. Apply hyper-parameter tuning techniques to get the best accuracy.

Ans 5C

Q 5D. D. Use any other technique/method which can enhance the model performance.

Ans 5D

Q 5E. Display and explain the classification report in detail.

Ans 5E

Q 5F. Apply the above steps for all possible models that you have learnt so far.

Ans 5F. Use Logistic Regression, KNN, Random Forest, Decision Tree and SVM models to validate the regular dataset, over sampled dataset and under sampled dataset


6. Post Training and Conclusion:

Q 6A. Display and compare all the models designed with their train and test accuracies.

Ans 6A.

Logictic Regression Normal Training Score 0.56% Logictic Regression Over sampling Training Score 0.56% Logictic Regression Under sampling Training Score 0.56% Logictic Regression Normal Testing Score 0.51% Logictic Regression Over sampling Testing Score 0.49% Logictic Regression Under sampling Testing Score 0.48% --------------------------------------------------------------- KNNeighbor Normal Training Score 0.56% KNNeighbor Over sampling Training Score 0.56% KNNeighbor Under sampling Training Score 0.56% KNNeighbor Normal Testing Score 0.49% KNNeighbor Over sampling Testing Score 0.49% KNNeighbor Under sampling Testing Score 0.49% --------------------------------------------------------------- RandomForest Normal Training Score 1.00% RandomForest Over Sampling Training Score 1.00% RandomForest Under Sampling Training Score 1.00% RandomForest Normal Testing Score 0.51% RandomForest Over Sampling Testing Score 0.50% RandomForest Under Sampling Testing Score 0.50% --------------------------------------------------------------- Decision Tree Classifier Normal Training Score 1.00% Decision Tree Classifier Over Sampling Training Score 1.00% Decision Tree Classifier Under Sampling Training Score 1.00% Decision Tree Classifier Normal Testing Score 0.51% Decision Tree Classifier Over Sampling Testing Score 0.49% Decision Tree Classifier Under Sampling Testing Score 0.52% --------------------------------------------------------------- Support Vector Machine Normal Training Score 0.54% Support Vector Machine Over sampling Training Score 0.57% Support Vector Machine under sampling Training Score 0.56% Support Vector Machine Normal Testing Score 0.51% Support Vector Machine Over sampling Testing Score 0.49% Support Vector Machine under sampling Testing Score 0.48% #

Q 6B. . Select the final best trained model along with your detailed comments for selecting this model

Ans 6B.

Q 6C. C. Pickle the selected model for future use

Ans 6C.

Q 6D. Write your conclusion on the results.

Ans 6D.